[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. #18864

bnellnm · 2025-05-28T23:40:02Z

Enable full fp8 support for pplx and BatchedTritonExperts.

Replace world_size/dp_size arguments to PrepareAndFinalize and Experts constructors with num_dispatchers.
Reduce use of duplicate information for setup, i.e. try to get all the parameters from the FusedMoEConfig rather than all2all_manager or random variables.
Rewrote the pplx tests so that they run in a loop on the spawned process rather than spawning a process for each test point. The original slow test points can still be run with the --optional pytest flag.
Add a bunch more quantization tests to cover all the combinations of per-token, per-tensor and blocked.

I've verified all the combinations from here work properly: dispatch_combine fp8 support matrix by branch + model.xlsx
with DP=2/TP=1, DP=2/TP=2 and DP=4/TP=1.

lm-eval results for RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic with pplx, DP=4, TP=1.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.86|±  |0.0349|
|     |       |strict-match    |     5|exact_match|↑  | 0.81|±  |0.0394|

cc @ElizaWszola

github-actions · 2025-05-28T23:40:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-28T23:40:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

mergify · 2025-06-03T03:38:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-13T02:45:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-26T22:27:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Bill Nell <bnell@redhat.com>

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/config.py

vllm/model_executor/layers/fused_moe/cutlass_moe.py

vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py

Signed-off-by: Bill Nell <bnell@redhat.com>

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/layer.py

varun-sundar-rabindranath · 2025-07-02T21:20:38Z

LGTM! Really nice cleanups @bnellnm 🙌

bnellnm · 2025-07-02T21:21:44Z

LGTM! Really nice cleanups @bnellnm 🙌

Thanks!

Signed-off-by: Bill Nell <bnell@redhat.com>

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

vllm/model_executor/models/granitemoe.py

Signed-off-by: Bill Nell <bnell@redhat.com>

ElizaWszola · 2025-07-03T10:27:21Z

tests/kernels/moe/test_batched_moe.py

+@pytest.mark.parametrize("dtype", [torch.float8_e4m3fn, torch.bfloat16])
+@pytest.mark.parametrize("per_act_token_quant", [False, True])
+@pytest.mark.parametrize("block_shape", [None, [128, 128]])
+@pytest.mark.parametrize("input_scales", [False])


Why is this only False?

I've left it here for future testing,

I see. Should there be also a condition in the test code to skip the test if input_scales == True and quant_dtype is None?

That's one of the conditions that needs more testing. There's some int8/int4 quantization schemes that happen outside the triton kernels. So they need to pass in the quantized data + scales, but no quant_type since they are already quantized.

ElizaWszola · 2025-07-03T12:52:53Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -178,6 +175,8 @@ def run_cutlass_moe_fp8(
        c2 = _resize_cache(workspace2, (M * topk, N))
        c3 = _resize_cache(workspace13, (M * topk, K))

+    c1.fill_(0)


Can we have a condition here that we only zero-out c1 if expert_map is not none and per_act_token == True? As far as I'm aware, this is the only case when it's needed

There's another PR that has the proper condition for this. I don't want to have to rerun everything at this point. I'll let that other PR push the better fix.

…project#18864) Signed-off-by: Bill Nell <bnell@redhat.com>

mergify bot added the v1 label May 28, 2025

mergify bot added the needs-rebase label May 28, 2025

bnellnm force-pushed the batch-fp8 branch from fa64b5a to d86e3f0 Compare May 28, 2025 23:41

mergify bot removed the needs-rebase label May 28, 2025

tlrmchlsmth mentioned this pull request Jun 2, 2025

[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode #19034

Merged

varun-sundar-rabindranath reviewed Jun 2, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_batched_moe.py Show resolved Hide resolved

mergify bot added the needs-rebase label Jun 3, 2025

bnellnm force-pushed the batch-fp8 branch from 20881d7 to 680de26 Compare June 13, 2025 02:25

mergify bot removed the needs-rebase label Jun 13, 2025

mergify bot added needs-rebase qwen Related to Qwen models labels Jun 13, 2025

bnellnm force-pushed the batch-fp8 branch 2 times, most recently from 911339b to f92734e Compare June 24, 2025 21:09

bnellnm mentioned this pull request Jun 25, 2025

[Kernels] MoE refactor #19636

Merged

bnellnm force-pushed the batch-fp8 branch from 3d226f5 to 347cda2 Compare June 26, 2025 21:20

bnellnm marked this pull request as ready for review June 26, 2025 21:20

bnellnm requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 26, 2025 21:20

mergify bot removed the needs-rebase label Jun 26, 2025

bnellnm changed the title ~~[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.~~ [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. Jun 26, 2025

mergify bot added the needs-rebase label Jun 26, 2025

bnellnm force-pushed the batch-fp8 branch from 347cda2 to 7219559 Compare June 27, 2025 03:13

mergify bot removed the needs-rebase label Jun 27, 2025

bnellnm added 3 commits July 2, 2025 16:04

ping

a5c8e85

Signed-off-by: Bill Nell <bnell@redhat.com>

fix num_dispatchers for TP+DP

285b2bc

Signed-off-by: Bill Nell <bnell@redhat.com>

fix unit test

286d988

Signed-off-by: Bill Nell <bnell@redhat.com>